{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "J1a1kYHwrcpm"
   },
   "source": [
    "# Welcome to this Tutorial\n",
    "\n",
    "The code that follows will walk you through several things:\n",
    "\n",
    "\n",
    "1.   How to get programmatically use the the Dialogflow API and to interact with your agent programmatically\n",
    "2.   How to use Tensorflow Hub to get start with Tensorflow Modules and how to use it to generate Embeddings for your training phrases\n",
    "3.   Once the embeddings are created, how can we use them to detect problems in our Dialogflow agent.\n",
    "\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/training-data-analyst/blob/master/courses/dialogflow-chatbot/notebooks/IntentHealthCheck.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before starting, we need to install few dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "SNn0Sx-Ose7L"
   },
   "outputs": [],
   "source": [
    "!pip install --quiet --upgrade tensorflow dialogflow scipy tensorflow-hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eV-PUfKfiF2q"
   },
   "outputs": [],
   "source": [
    "import dialogflow_v2 as dialogflow\n",
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from collections import defaultdict\n",
    "import os\n",
    "from google.colab import auth\n",
    "import pickle\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "znWzuMVJiPX7"
   },
   "source": [
    "Now, let's set the variables we need. These values should match the entried you used when creating a project and a service account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "cellView": "code",
    "colab": {},
    "colab_type": "code",
    "id": "xSATtr21thaG"
   },
   "outputs": [],
   "source": [
    "PROJECT_ID=\"\" # Set your GCP Project Id\n",
    "SERVICE_ACCOUNT_EMAIL=\"\" # Set your Dialogflow service account email"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "L-Y4fC3ZixZs"
   },
   "source": [
    "#### Great! \n",
    "Now, we need to authenticate your session so you can create a key for your service account.\n",
    "\n",
    "Follow the instructions in the output section of the cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "JDdAAVeVtMCX"
   },
   "outputs": [],
   "source": [
    "auth.authenticate_user()\n",
    "!gcloud config set project {PROJECT_ID}\n",
    "!gcloud iam service-accounts keys create sa-key.json \\\n",
    " --iam-account={SERVICE_ACCOUNT_EMAIL} --project={PROJECT_ID}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VhAkJYYLjRvO"
   },
   "source": [
    "#### Let's take a snapshot of your Dialogflow account\n",
    "\n",
    "The following code will issue two api calls to read the entities and the intents of your Dialogflow agent. We'll need these to perform our analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qao-CnY2u0aV"
   },
   "outputs": [],
   "source": [
    "def fetch_intents_training_phrases(service_account_file, project):\n",
    "\n",
    "  dialogflow_entity_client = dialogflow.EntityTypesClient.from_service_account_file(service_account_file)\n",
    "  parent = dialogflow_entity_client.project_agent_path(project)\n",
    "  entities = list(dialogflow_entity_client.list_entity_types(parent))\n",
    "\n",
    "  dialogflow_intents_client = dialogflow.IntentsClient.from_service_account_file(service_account_file)\n",
    "  parent = dialogflow_intents_client.project_agent_path(project)\n",
    "  intents = list(dialogflow_intents_client.list_intents(\n",
    "    parent=parent,\n",
    "    intent_view=dialogflow.enums.IntentView.INTENT_VIEW_FULL))\n",
    "\n",
    "  entities_name_to_value = {\n",
    "    'date-time': 'tomorrow afternoon',\n",
    "    'date': 'tomorrow',\n",
    "    'date-period': 'April',\n",
    "    'time': '4:30 pm',\n",
    "    'time-period': 'afternoon',\n",
    "    'number': 'one',\n",
    "    'cardinal': 'ten',\n",
    "    'ordinal': 'tenth',\n",
    "    'number-integer': '1',\n",
    "    'number-sequence': '1 2 3',\n",
    "    'flight-number' : 'LH4234',\n",
    "    'unit-area': 'ten square feet',\n",
    "    'unit-currency': '5 dollars',\n",
    "    'unit-length': 'ten meters',\n",
    "    'unit-speed': '5 km/h',\n",
    "    'unit-volume': '2 liters',\n",
    "    'unit-weight': '5 kilos',\n",
    "    'unit-information': '5 megabytes',\n",
    "    'percentage': '10 percent',\n",
    "    'temperature': '25 degrees',\n",
    "    'duration': '5 days',\n",
    "    'age': '1 year old',\n",
    "    'currency-name': 'euros',\n",
    "    'unit-area-name': 'suqare meters',\n",
    "    'unit-length-name': 'meters',\n",
    "    'unit-speed-name': 'kilometer per hour',\n",
    "    'unit-volume-name': 'cubic meters',\n",
    "    'unit-weight-name': 'kilograms',\n",
    "    'unit-information-name': 'megabytes',\n",
    "    'address': '1600 Amphitheatre Pkwy, Mountain View, CA 94043',\n",
    "    'zip-code': '94122',\n",
    "    'geo-capital': 'Rome',\n",
    "    'geo-country': 'Denmark',\n",
    "    'geo-country-code': 'US',\n",
    "    'geo-city': 'Tokyo',\n",
    "    'geo-state': 'Scotland',\n",
    "    'place-attraction': 'Golden Gate Bridge',\n",
    "    'airport': 'SFO',\n",
    "    'location': '1600 Amphitheatre Pkwy, Mountain View, CA 94043',\n",
    "    'email': 'test@example.com',\n",
    "    'phone-number': '+11234567890',\n",
    "    'given-name': 'Joe',\n",
    "    'last-name': 'Smith',\n",
    "    'music-artist': 'Beatles',\n",
    "    'music-genre': 'Jazz',\n",
    "    'color': 'Blue',\n",
    "    'language': 'Japanese',\n",
    "    'any': 'flower',\n",
    "    'url': 'google.com'\n",
    "  }\n",
    "  for intent in intents:\n",
    "      entities_used = {entity.display_name \n",
    "        for entity in intent.parameters}\n",
    "\n",
    "      for entity in entities:\n",
    "          if entity.display_name in entities_used \\\n",
    "                  and entity.display_name not in entities_name_to_value:\n",
    "                  \n",
    "              entities_name_to_value[entity.display_name] = np.random.choice(\n",
    "                  np.random.choice(entity.entities).synonyms, replace=False)\n",
    "\n",
    "  intent_training_phrases = defaultdict(list)\n",
    "  for intent in intents:\n",
    "      for training_phrase in intent.training_phrases:\n",
    "\n",
    "          parts = [\n",
    "              entities_name_to_value[part.alias] \n",
    "              if part.alias in  entities_name_to_value else part.text\n",
    "              for part in training_phrase.parts\n",
    "          ]\n",
    "          intent_training_phrases[intent.display_name].append(\n",
    "              \"\".join(parts))\n",
    "      # Remove intents with no training phrases\n",
    "      if not intent_training_phrases[intent.display_name]:\n",
    "          del intent_training_phrases[intent.display_name]\n",
    "  return intent_training_phrases\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5_H98m8dIY0d"
   },
   "outputs": [],
   "source": [
    "intent_training_phrases = fetch_intents_training_phrases(\"sa-key.json\", PROJECT_ID)\n",
    "\n",
    "for intent in intent_training_phrases:\n",
    "  print(\"{}:{}\".format(intent, intent_training_phrases[intent]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rUMUOAIFGW8A"
   },
   "source": [
    "### Let's create some embeddings\n",
    "\n",
    "Embeddings are a projection of your data on a space with fewer dimensions\n",
    "compared to the original space.\n",
    "\n",
    "It is a lossy transformation and, in this case, we transform the training phrases into points in a space with 512 dimensions.\n",
    "\n",
    "To do so, we'll use a Tensorflow hub module. [The universal sentence encoder](https://tfhub.dev/google/universal-sentence-encoder/2).\n",
    "\n",
    "Before working on our data, let's see how it works.\n",
    "\n",
    "Please note that for each new session the following code block can take a while to complete.\n",
    "This is because for every session, Google Cloud provisions a new virtual machine, and we need to download the Tensorflow hub module. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "uQHoK4FOHP9u"
   },
   "outputs": [],
   "source": [
    "embed_module = hub.Module(\"https://tfhub.dev/google/universal-sentence-encoder/2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fK7VgQKZkTJ0"
   },
   "source": [
    "Now that we have the module and the Tensorflow operation defined, we can construct the first embeddings.\n",
    "\n",
    "Before deep-diving in our Dialogflow's training phrases, let's analyze a small and simple synthetic example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0XFREDe8kSMc"
   },
   "outputs": [],
   "source": [
    "def make_embeddings_fn():\n",
    "  placeholder = tf.placeholder(dtype=tf.string, shape=[None])\n",
    "  embed = embed_module(placeholder)\n",
    "  session = tf.Session()\n",
    "  session.run([tf.global_variables_initializer(), tf.tables_initializer()])\n",
    "  def _embeddings_fn(sentences):\n",
    "      computed_embeddings = session.run(\n",
    "        embed, feed_dict={placeholder: sentences})\n",
    "      return computed_embeddings\n",
    "  return _embeddings_fn\n",
    "\n",
    "generate_embeddings = make_embeddings_fn()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "UZLqLTbwl_9f"
   },
   "outputs": [],
   "source": [
    "sentences = [\n",
    "  \"Hi\",\n",
    "  \"Hello\",\n",
    "  \"Goodbye\",\n",
    "  \"I'm a software program\"\n",
    "]\n",
    "computed_embeddings = generate_embeddings(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9zvE35Ohk0O1"
   },
   "source": [
    "Now you can see how the embeddings look like. Specifically, they are projections of the original sentences in a `512` dimensional space. This high-dimensional space is impossible to visualize for a human. So, let's project them to a 3-d space that we can visualize and comprehend.\n",
    "\n",
    "To achieve this, we use Principal Component Analysis (PCA). PCA is a lossy transformation that conserves most of the variance in the data. This let's us plot the points in a 3d space while conserving the best we can the distance between the sentences.\n",
    "\n",
    "Let's also print the relative distances between every point-pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "M6rH0THPJUwg"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "\n",
    "point_size=200\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "points_2d = pca.fit_transform(computed_embeddings)\n",
    "\n",
    "fig = plt.figure(figsize=(16,10))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "for point, marker in zip(points_2d, ['o', '^', '*', 's']):\n",
    "  xs = point[0]\n",
    "  ys = point[1]\n",
    "  ax.scatter(xs, ys, marker=marker, s=point_size)\n",
    "\n",
    "ax.set_xlabel('X Dimension')\n",
    "ax.set_ylabel('Y Dimension')\n",
    "\n",
    "ax.legend(sentences)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8lHNEi5F10RB"
   },
   "source": [
    "As you can see, our method seems to work. 'Hi' and 'Hello' are similar sentences, and hence they remain closer than any other pair of sentences in our simple example.\n",
    "\n",
    "Let's now use this method to analyze our dialogflow training phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vumjAEHc3d0J"
   },
   "outputs": [],
   "source": [
    "training_phrases_with_embeddings = defaultdict(list)\n",
    "for intent_name, training_phrases_list in intent_training_phrases.items():\n",
    "  computed_embeddings = generate_embeddings(training_phrases_list)\n",
    "  training_phrases_with_embeddings[intent_name] = dict(zip(training_phrases_list, computed_embeddings))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7EzLKfUJLX3H"
   },
   "outputs": [],
   "source": [
    "for intent_name, _ in training_phrases_with_embeddings.items():\n",
    "  training_phrase, embeddings = next(iter(training_phrases_with_embeddings[intent_name].items()))\n",
    "  print(\"{}: {{'{}':{}}}\".format(intent_name, training_phrase, embeddings[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oHvhF65s5vyS"
   },
   "source": [
    "### Let's see how our training phrases are doing!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ck0u6wp86DED"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "embedding_vectors = []\n",
    "\n",
    "for intent, training_phrases_and_embeddings in training_phrases_with_embeddings.items():\n",
    "  for training_phrase, embeddings in training_phrases_and_embeddings.items():\n",
    "    embedding_vectors.append(embeddings)\n",
    "\n",
    "embedding_vectors = np.asarray(embedding_vectors)\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "pca.fit(embedding_vectors)\n",
    "\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig = plt.figure(figsize=(15,10))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "legend = []\n",
    "\n",
    "for color, intent in enumerate(training_phrases_with_embeddings.keys()):\n",
    "  phrases = list(training_phrases_with_embeddings[intent].keys())\n",
    "  embeddings = list(training_phrases_with_embeddings[intent].values())\n",
    "  points = pca.transform(embeddings)\n",
    "  xs = points[:,0]\n",
    "  ys = points[:,1]\n",
    "  ax.scatter(xs, ys, marker='o', s=100, c=\"C\"+str(color))\n",
    "  for i, phrase in enumerate(phrases):\n",
    "    ax.annotate(phrase[:15] + '...', (xs[i], ys[i]))\n",
    "  legend.append(intent)\n",
    "\n",
    "\n",
    "ax.legend(legend)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yIRawh94icS7"
   },
   "source": [
    "### Cosine similarity\n",
    "[Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is a similarity metric that scales well high-dimensional spaces. What it does is simple: given two vectors of numbers it calculates the cosine of angle between them.\n",
    "\n",
    "\n",
    "![formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d)\n",
    "\n",
    "![](http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png)\n",
    "\n",
    "To calculate this we can use `sklearn.metrics` package without coding it ourselves.\n",
    "\n",
    "\n",
    "Before computing every possible pair of sentences, let's measure how *disperse* are our intents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Q5VM6KqfibRT"
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity      \n",
    "\n",
    "flatten = []\n",
    "\n",
    "for intent in training_phrases_with_embeddings:\n",
    "  for phrase in training_phrases_with_embeddings[intent]:\n",
    "    flatten.append((intent, phrase,  training_phrases_with_embeddings[intent][phrase]))\n",
    "\n",
    "data = []\n",
    "for i in range(len(flatten)):\n",
    "  for j in range(i+1, len(flatten)):\n",
    "\n",
    "    intent_1 = flatten[i][0]\n",
    "    phrase_1 = flatten[i][1]\n",
    "    embedd_1 = flatten[i][2]\n",
    "\n",
    "    intent_2 = flatten[j][0]\n",
    "    phrase_2 = flatten[j][1]\n",
    "    embedd_2 = flatten[j][2]\n",
    "\n",
    "    similarity = cosine_similarity([embedd_1], [embedd_2])[0][0]\n",
    "\n",
    "    record = [intent_1, phrase_1, intent_2, phrase_2, similarity]\n",
    "    data.append(record)\n",
    "\n",
    "similarity_df = pd.DataFrame(data, \n",
    "  columns=[\"Intent A\", \"Phrase A\", \"Intent B\", \"Phrase B\", \"Similarity\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VmGZL9h8kShk"
   },
   "outputs": [],
   "source": [
    "different_intent = similarity_df['Intent A'] != similarity_df['Intent B']\n",
    "display(similarity_df[different_intent].sort_values('Similarity', ascending=False).head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "r9IrieApn21z"
   },
   "source": [
    "### Compute Intents Cohesion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-2mjItlWLVP3"
   },
   "outputs": [],
   "source": [
    "same_intent = similarity_df['Intent A'] == similarity_df['Intent B']\n",
    "cohesion_df = pd.DataFrame(similarity_df[different_intent].groupby('Intent A', as_index=False)['Similarity'].mean())\n",
    "cohesion_df.columns = ['Intent', 'Cohesion']\n",
    "display(cohesion_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "uLpG39zXonvy"
   },
   "outputs": [],
   "source": [
    "different_intent = similarity_df['Intent A'] != similarity_df['Intent B']\n",
    "separation_df = pd.DataFrame(similarity_df[different_intent].groupby(['Intent A', 'Intent B'], as_index=False)['Similarity'].mean())\n",
    "separation_df['Separation'] = 1 - separation_df['Similarity']\n",
    "del separation_df['Similarity']\n",
    "display(separation_df.sort_values('Separation'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TS1krQTCzMww"
   },
   "source": [
    "## License"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XQ01q3_azM6O"
   },
   "source": [
    "Authors:  Marcello Stiner & Khalid Salama\n",
    "\n",
    "---\n",
    "**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.\n",
    "\n",
    "---\n",
    "\n",
    "Copyright 2019 Google LLC\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software\n",
    "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "See the License for the specific language governing permissions and\n",
    "limitations under the License.\n",
    "\n",
    "\n",
    "---\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "[MAKE A COPY] Dialogflow Intent health check.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
